® 



J 



EuropSlsches Patentamt 
European Patent Office 
Office europeen des brevets 



© Publication number: 



0 508 225 A2 



EUROPEAN PATENT APPLICATION 



© Application number: 92105187.6 
© Date of filing: 26.03.92 



© Int. CI. 5 : G10L 5/06 



© Priority: 11.04.91 DE 4111781 


Ladenburger Strasse 23 




w-o954 nirscnoerg(ut) 


@ Date of publication of application: 


Inventor: Mohr, Karlheinz 


14.10.92 Bulletin 92/42 


Egerlandstrasse 5 




W-6920 Sinsheim(DE) 


© Designated Contracting States: 


Inventor: Schmidt, Rudolf 


AT BE CH DE DK FR GB LI LU NL SE 


Odenwaldstrasse 47/4 




W-Heidelberg(DE) 


© Applicant: International Business Machines 


Inventor: Walch, Georg, Dr. 


Corporation 


Valentinianstrasse 74 


Old Orchard Road 


W-6802 Ladenburg(DE) 


Armonk, N.Y. 10504(US) 


Inventor: Wothke, Klaus, Dr. 




Kastellweg 15 


© Inventor: Bandara, Upali, Dr. 


W-6900 Heidelberg(DE) 


Tlngueux-Allee 17 




W-6906 Leimen(DE) 


© Representative: Schafer, Wolfgang, Dipl.-lng. 


Inventor: Hitzenberger, Ludwig, Dr. 


Nadelspitzweg 19 


European Patent Attorney, IBM Deutschland 


W-8419 Schoenhofen(DE) 


GmbH, Schonaicher Strasse 220 


Inventor: Keppel, Eric, Dr. 


W-7030 Boblingen(DE) 



© Computer system for speech recognition. 



CM 
< 

LO 
CM 
CM 

00 

© 
in 



© A computer system for speech recognition is 
described which, using a microphone, an acoustic 
transducer and a processor, converts speech text 
into digitized phonemes. By means of training texts, 
the phonemes and the words to be recognized by 
the computer system are stored in a memory 
means. In conjunction with a text subsequently dic- 
tated by the speaker, the uttered and the stored 
phonemes are compared. Using statistical methods 
and models, the computer system detects those 
phonemes and, based thereon, the speaker's utter- 
ance. There are natural phonemes, meaning actual 
utterances, and artificial phonemes which are artifi- 
cially generated by the computer system in the 



training phase. Thus, an artificial phoneme is pro- 
vided which is associated with pauses occurring 
between the components of a compound word. By 
detecting this interval, the word can be split into its 
components which are separately processed. Also 
provided are artificial phonemes resembling at least 
two natural phonemes. This allows combining dif- 
ferent phoneme sequences of one and the same 
word to a single phoneme sequence. By using artifi- 
cial phonemes, the memory space required is re- 
duced. At the same time, the compare steps for the 
uttered and stored phonemes are reduced, which 
increases the processing speed of the computer 
sytem. 
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The invention concerns a computer system for 
speech recognition, comprising means for convert- 
ing spoken words into sounds (phonemes), means 
for storing words in a phonetic representation 
(phonetic baseform), the phonetic baseform includ- 
ing a number of natural phonemes, and means for 
comparing the converted and the stored pho- 
nemes. 

Such computer systems for speech recognition 
are known from the art and are used mostly as 
automatic dictating machines. The text is dictated 
by the speaker into a microphone and converted 
into electric signals. The electric signals are used 
by the computer system to generate digitized pho- 
nemes corresponding to the speech text. The 
digitized phonemes are then compared with digitiz- 
ed phonemes previously generated in a training 
phase and stored in a memory of the computer 
system. During the training phase, a plurality of 
words, i.e. their correct letter sequence and the 
associated speaker-dependent phoneme sequence, 
are stored in the computer memory. With the aid of 
the probability calculus, the computer system first 
looks for those phonemes and then for those words 
in memory which are most likely to concur with the 
speech text. The words thus recognized are finally 
checked as to textual context and, if necessary, 
corrected. As a result, the dictated text is recog- 
nized by the computer system and may be pro- 
cessed further, for example, by a text processing 
system, displayed or printed. 

A major problem with computer systems used 
for speech recognition is to store in their memories 
a vocabulary of the words to be recognized. This 
problem is particularly pronounced with the Ger- 
man language amongst others for two reasons. On 
the one hand, the German language comprises a 
plurality of compound words which all have to be 
stored individually. Thus, for instance, it is neces- 
sary to store not only the word 
"Fahrkartenschalter" but also its components 
"Fahrkarte", "Karte" and "Schalter", as well as 
other words including such components, for in- 
stance, "Schalterstunde", "Fahrstunde", etc. On 
the other hand, there are numerous German dia- 
lects with greatly varying pronunciations of the 
same word. As such phonological variations cannot 
be stored in the computer memory as a single 
phoneme sequence, it is necessary instead to store 
several phoneme sequences for one and the same 
word for each dialect. Consequently, a computer 
system used for speech recognition in the German 
language has to store a very large number of 
words. Apart from the fact that the memory space 
of a computer system is limited, the disadvantage 
of storing a large number of words is that the 
compare process of the words dictated by the 
speaker with those stored in memory is greatly 



slowed down, so that the computer system is no 
longer able to "follow" the speaker. The speaker 
has to "wait" for the computer system to catch up, 
which in practice adversely affects the computer's 
5 application for speech recognition. 

It is the object of the invention to provide a 
computer system for speech recognition which is 
sufficiently fast to readily follow the text dictated by 
a speaker. 

w According to the invention, this object is ac- 
complished such that the phonetic baseforms of 
the words in a computer system of the previously 
described kind also comprise a number of artificial 
phonemes. 

?5 Artificial phonemes are phonemes which the 
speaker does not utter as such. During the training 
phase of the computer system, these phonemes 
are artificially generated by the computer system 
depending on how particular words are pronounced 

20 by the speaker. Such artificial phonemes have the 
advantage that they allow words to be represented 
as phonetic baseforms which consume less mem- 
ory space than phonetic baseforms of natural pho- 
nemes. This greatly reduces the memory space 

25 needed for storing the vocabulary to be recognized 
by the computer and the time required by the 
computer system for recognizing the speech text. 
As a result, the computer system is able to 
"follow" the text dictated by the speaker and per- 

30 form its recognition tasks in the real time mode. 

One embodiment of the invention is character- 
ized in that a first silent phoneme, associated with 
the pause between two words, is provided as a 
natural phoneme, and that a second silent pho- 

35 neme, associated with the pause between two 
components of a compound word, is provided as 
an artificial phoneme. 

By means of the second, artificial, phoneme, 
the computer system detects the interface between 

40 two components of a compound word. The com- 
pound word may be split into its components and 
each component may be individually processed in 
and recognized by the computer system. Thus, 
instead of storing the compound word "as a whole" 

45 in the computer memory, it is sufficient to store its 
components. Storing compound words as such is 
no longer necessary. The advantage of this is that 
when the words dictated by the speaker are com- 
pared with stored words, the time required for this 

so purpose is greatly reduced. 

Another embodiment of the invention provides 
for an artificial phoneme which resembles at least 
two natural phonemes. According to this embodi- 
ment, a word or a component of a compound word 

55 whose phonetic baseform may differ with regard to 
the sequence of similar natural phonemes in at 
least two ways is represented by the associated 
phoneme(s). 
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Such an artificial phoneme allows representing 
phonological variations of one and the same word 
as a single phonetic baseform. Therefore, to ac- 
count for dialectal differences in the pronunciation 
of the same word, it is no longer necessary to store 
several phonetic baseforms in the computer mem- 
ory but only one single baseform comprising artifi- 
cial in addition to natural phonemes. Multiple stor- 
age operations for one and the same word are thus 
eliminated. 

Further embodiments of the invention may be 
seen from the patent claims as well as from the 
subsequent description of examples of the inven- 
tion, which are described by way of drawings. The 
numerical values appearing in the description of 
the examples have proved advantageous in prac- 
tice, so that detailed explanations have been omit- 
ted. 

Fig. 1 is a simplified block diagram of a 
computer system for speech recogni- 
tion; 

Fig. 2 is a diagram showing an electric 
speech signal with time; 

Fig. 3 is a diagram showing a spectrum of 
the speech signal of Fig. 2 with fre- 
quency; 

Fig. 4 is assumed to show a multi-dimen- 
sional space with dot clusters; 
Fig. 5 is a table in which particular label se- 
quences are associated with a number 
of phonemes, and 
Fig. 6 is a table in which particular phoneme 
sequences are associated with a num- 
ber of words. 
The computer system for speech recognition 
illustrated in Fig. 1 comprises a microphone 10 
which is connected to an acoustic transducer 11. 
The transducer 11 is linked to a processor 15. Also 
provided is a memory 13 which is connected to 
processor 15. Processor 15 finally is connected to 
a display unit 17. 

In response to a speaker speaking into the 
microphone 10 an electric amplitude signal AS is 
generated which is shown with the time t in Fig. 2. 
This amplitude signal AS is Fourier-transformed, 
with time frames of 20 ms duration forming one 
Fourier spectrum. Such a Fourier spectrum FT is 
shown with the frequency f in Fig. 3. The Fourier 
spectrum FT of Fig. 3 is associated with the am- 
plitude signal AS in the first 20 ms time frame 
shown in Fig. 2. Each Fourier spectrum FT is 
subdivided into 20 bands, each representing the 
value of the associated Fourier spectrum FT. Thus, 
for each of the 20 ms time frames of the amplitude 
signal AS there is a vector V with the 20 values 
W1, W2, ... W19, W20 of the associated Fourier 
spectrum FT. 

The axes in Fig. 4 represent a multi-dimen- 



sional space. This space shown as 3-dimensional 
in the figure is assumed to be 20-dimensional 
(unrepresentable as such) for the present invention. 
A plurality of dots are entered into such a 20- 

5 dimensional space. Each dot corresponds to a vec- 
tor V, with the position of the dot in the 20-dimen- 
sional space being defined by the 20 values of the 
vector V. Thus, the 20-dimensional space of Fig. 4 
contains the respective Fourier spectra FT of suc- 

w cessive 20 ms time frames of the amplitude signal 
AS. 

It has been found however that the dots en- 
tered in the 20-dimensional space of Fig. 4 are not 
uniformly distributed but that they appear in clus- 

75 ters. These clusters are speech-dependent so that 
the clusters for the German language differ from 
those of the French or the English language. 

The clusters are numbered consecutively. This 
is shown in Fig. 4 by each cluster being associated 

20 with a label, such as L 15 or L 147. Thus 200 
clusters are designated by 200 labels in this way. 

Laboratory training ensures that the positions 
of the clusters in the 20-dimensional space are 
only language-dependent and substantially 

25 speaker-independent. For this training, 10 different 
speakers utter the various words to be recognized 
by the computer system. This information is used 
by the computer to determine the positions of the 
clusters by statistical methods and models (such 

30 as Markov). 

Laboratory training is also used for the follow- 
ing purposes. The words to be recognized by the 
computer system are stored in memory 13 such 
that the spelling of a word is associated with a 

35 phonetic baseform. This phonetic baseform of a 
word consists of concatenated phonemes of which 
there are 60. Each phoneme is made up of a string 
of labels which are each associated with predeter- 
mined 20 ms time frames of the amplitude signal 

40 AS in Fig. 2. If, for example, the amplitude signal 
AS of Fig. 2 represents a single phoneme, this 
phoneme is represented with the aid of the Fourier 
spectrum FT (Fig. 3) by three labels in the 20- 
dimensional space of Fig. 4. During laboratory 

45 training, a substantially speaker-independent label 
sequence is generated for each phoneme by statis- 
tical methods and models. The association of the 
different label sequences with the total number of 
60 phonemes is stored in memory 13 of the com- 

50 puter system. 

Laboratory training is followed by speaker-de- 
pendent training. During the latter, the speaker 
utters a number of predetermined sentences. The 
uttered sentences are used by the computer sys- 

55 tern to adapt the speaker-independent label se- 
quences obtained during laboratory training to spe- 
cific speakers. Upon completion of the speaker- 
dependent training, memory 13 of . the computer 
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system stores a first table in which each of the 60 
phonemes is related in a speaker-dependent fash- 
ion to the associated expected label sequence in 
the form of statistical models. This first table is 
shown in Fig. 5. Memory 13 of the computer sys- 
tem also stores a second table containing the asso- 
ciated phoneme sequence for each word to be 
recognized by the computer. This second table is 
shown in Fig. 6. 

During speech recognition, the computer sys- 
tem uses the Fourier spectrum FT to generate a 
label sequence from the speech text. The com- 
puter compares this label sequence with the label 
sequences stored in the first table (Fig. 5). From 
this table the computer system selects those pho- 
nemes which are most likely to concur with the 
predetermined label sequence. The phonemes ob- 
tained are concatenated and compared with the 
phoneme sequences stored in the second table 
(Fig. 6). From this table, too, the computer system 
selects those words that are most likely to concur 
with the predetermined phoneme sequences. The 
computer system checks the various most likely 
combinations of phonemes and words, so that in 
most cases there will be several words which the 
computer system recognizes as text uttered by the 
same speaker. 

By means of a speech model, a detailed de- 
scription of which has been omitted, the computer 
system checks several successively recognized 
words and decides from context which text has 
most likely been uttered by the speaker. The text 
recognized by the computer system is then dis- 
played by display unit 17. 

As has been explained above, phonemes are 
used to derive phonetic baseforms and to enter the 
words into memory 13 of the computer system. For 
this purpose, natural and artificial phonemes are 
employed. Natural phonemes are those which the 
speaker actually utters. Artificial phonemes on the 
other hand are not uttered by the speaker but are 
artificially generated by the computer system de- 
pending upon the function associated with them. 
The label sequence representing the phonemes is 
obtained by laboratory and speaker-dependent 
training, regardless of whether the phonemes con- 
cerned are natural or artificial ones. 

As a natural phoneme, associated with the 
speech pause between two successive words, a 
first silent phoneme x is provided. This phoneme is 
contained in the table of Fig. 5 and usually occurs 
at the end of a phoneme sequence pertaining to a 
word, as illustrated for the word "staerken" in the 
table of Fig. 6. In addition, an artificial phoneme is 
provided as a second silent phoneme z, which 
denotes the speech pause between two successive 
components of a compound word. The speech 
pause between two components of a compound 



word is substantially shorter than the speech pause 
between two words. In an extreme case, the 
speech pause between the components of a word 
may be almost 0. 

s By laboratory and speaker-dependent training 

it is possible to train the computer system not only 
with respect to the natural first silent phoneme x 
but also with respect to the artificial second silent 
phoneme denoting the speech pause between two 

70 components. To this end, the speakers utter pre- 
determined compound words during the training 
phase, from which the computer system generates 
a label sequence for this second silent phoneme z 
by statistical methods and models. 

75 If the computer system subsequently spots this 
label sequence in a speech text it is able to de- 
duce therefrom that the phonemes occurring before 
and after the second silent phoneme are compo- 
nents of a compound word. The artificial second 

20 silent phoneme z is treated In the same way as the 
natural first silent phoneme x. As a result, the 
compound word is split into its components which 
are separately processed by the computer system. 
The second silent phoneme z is shown in the 

25 table of Fig. 5 along with an associated label se- 
quence and also appears in the table of Fig. 6 in 
conjunction with the phoneme sequence of the 
words "Schlag" and "Wort" at the beginning and 
the end. 

30 Without the artificial second silent phoneme z, 
it would be necessary to store all compound words 
along with their phoneme sequences in the mem- 
ory of the computer system. This means that not 
only the words "Schlag" and "Wort" would have to 

35 be stored but also, for instance, the words 
"Schlagwort", "Schlagball", "Gegenschlag", 
"Schiusswort", "Wortspiel", etc. 

In contrast to this, by using the additional artifi- 
cial second silent phoneme z, it is merely neces- 

40 sary to store the two words "Schlag" and "Wort" in 
memory 13. Whenever one of the two words ap- 
pears as a component of a compound word, the 
compound word is split up, as previously de- 
scribed, by means of the second silent phoneme z 

45 and the respective component "Schlag" or "Wort" 
is separately recognized by the computer system. 

By means of a further function, the split com- 
ponents are combined to a compound word. This 
may be done by the computer system recognizing 

50 from the context of the words that the components 
belong to a compound word. Alternatively, the split 
components may be indexed by means of the 
second silent phoneme z and combined accord- 
ingly later on 

55 By using the artificial second silent phoneme z, 

the memory space required for compound words is 
substantially reduced. The computer system which 
for speech recognition compares the phoneme se- 
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quences of the various words stored in memory 1 3 
with the phoneme sequences uttered by the speak- 
er has to carry out much fewer comparisons and 
statistical computations connected therewith. As a 
result, the processing speed of the computer sys- 
tem is greatly increased. 

The table of Fig. 5 shows a number of pho- 
nemes and their associated label sequence, such 
as the phonemes SCH, L, A, ... K, R, E, etc. The 
table of Fig. 5 also contains phonemes resembling 
them. Thus, there is, for instance, the nasalized 
phoneme Kn. This phoneme resembles the pho- 
neme K which is also included and which is pro- 
nounced clearly. Another example is the phoneme 
Ar which primarily occurs at the end of words, as, 
for example, in the word "besser". This phoneme 
resembles the phoneme R which is also listed and 
which appears, for instance, in the word "Radio". 
All of these phonemes are natural ones. 

In addition to those natural phonemes, there 
are artificial phonemes which resemble in each 
case at least two natural phonemes. Thus, for in- 
stance, an artificial phoneme K1n is provided which 
resembles the two natural phonemes K and Kn. In 
the same manner, an artificial phoneme A1r is 
provided which resembles the natural phonemes R 
and Ar. Finally, an artifical phoneme E0 is provided 
which resembles the natural phoneme E and which 
simultaneously denotes instances where the natural 
phoneme E is swallowed by the speaker, as, for 
example, in words ending with "-en". 

These artificial phonemes are not actually ut- 
tered by the speaker. They rather denote the dif- 
ferent ways in which they might be uttered by the 
speaker. During the laboratory and the speaker- 
dependent training, the speakers dictate predeter- 
mined words from which the computer system gen- 
erates the label sequence of the afore-mentioned 
artificial phonemes by statistical methods and 
models. In conjunction with a text subsequently 
dictated by the speaker, the computer recognizes 
such label sequences and thus the associated ar- 
tificial phonemes. 

In the absence of such artificial phonemes, the 
computer system would have to store any conceiv- 
able phoneme sequence of a particular word. For 
the word "staerken", for example, the following 
phoneme sequences would have to be stored: 

SCH-T-AE-R-K-E-N 

SCH-T-AE-R-Kn-E-N 

SCH-T-AE-R-K-N 

SCH-T-AE-Ar-K-E-N 

SCH-T-AE-Ar-Kn-E-N 

SCH-T-AE-Ar-K-N. 

When the indicated artificial phonemes are 
used, only one phoneme sequence has to be 



stored for the word "staerken" in the memory 13 of 
the computer system (see table illustrated in Fig. 
6). In this table, phonemes R and Ar resembling 
each other are replaced by the artificial phoneme 

5 A1r. Equally, the natural phonemes K and Kn re- 
sembling each other are replaced by the artificial 
phoneme K1n. In place of the natural phoneme E 
and in cases where this phoneme is suppressed, 
the artifical phoneme EO is provided. In this man- 

jo ner, six different phoneme sequences for one and 
the same word are replaced by a single artificial 
phoneme sequence. 

As a result, the memory space required for 
different phoneme sequences of words is substan- 

75 tially reduced. The computer system has to carry 
out much fewer operations for comparing the words 
dictated by the speaker with phoneme sequences 
of words stored in memory 13. This, in turn, in- 
creases the processing speed of the computer 

20 system. 

By using artificial phonemes, the number of 
words to be stored in memory 13 of the computer 
system as well as the number of phoneme se- 
quences are substantially reduced. This means that 

25 the number of words and phoneme sequences to 
be checked for speech recognition is equally re- 
duced. As a result, the processing speed of the 
computer system is increased and the computer 
system is capable of operating in the real time 

30 mode. 

Claims 

1. Computer system for speech recognition, 

35 

comprising means (10, 11, 15) for converting 
spoken words into phonemes, 

means (13, 15) for storing words in a phonetic 
40 baseform, the phonetic baseform including a 

number of natural phonemes, and 

means for comparing the converted and the 
stored phonemes, 

45 

characterized in that the phonetic baseform of 
the words also comprises a number of artificial 
phonemes. 

so 2. Computer system as claimed in claim 1, 
characterized in that 

a first silent phoneme (x), associated with the 
55 speech pause between two words, is provided 

as a natural phoneme, 

and that a second silent phoneme (z), asso- 
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ciated with the speech pause between two 
components of a compound word, is provided 
as an artificial phoneme. 

3. Computer system as claimed in claim 2, 5 

characterized in that means for combining 
components of a compound word are pro- 
vided. 

70 

4. Computer system as claimed in any one of the 
claims 1 to 3, 

characterized in that an artificial phoneme 
(K1n, A1r, E0) is provided which resembles at is 
least two natural phonemes (K, Kn, R, Ar, E). 

5. Computer system as claimed in claim 4, 

characterized in that a word or a component of 20 
a compound word, whose phonetic baseform 
may consist of at least two sequences of simi- 
lar phonemes, is represented by the asso- 
ciated artificial phoneme(s). 

25 
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PHONEME LABEL SEQUENCE 



SCH L157-L27-L89-L112 
L L12-L17-L97 

A L47-L172 

V 
0 
T 

AE 
N 

• 
• 

K 

Kn 

Kin 

• 
• 

R 
Ar 
A1r 

E 

eo : 

• 

X L129-L53-L52-L76 
Z L151-L129-L9 



6 

WORD PHONE SEQUENCE 

SCHLAG z-SCH-L-A-K-z 

WORT z-V-O-R-T-z 

STARKEN SCH-T-AE-A1r_K1n-E0-N-x 
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© Computer system for speech recognition. 



© A computer system for speech recognition is 
described which, using a microphone, an acoustic 
transducer and a processor, converts speech text 
into digitized phonemes. By means of training texts, 
the phonemes and the words to be recognized by 
(V) the computer system are stored in a memory 
^ means. In conjunction with a text subsequently dic- 
tated by the speaker, the uttered and the stored 
<^ phonemes are compared. Using statistical methods 
N and models, the computer system detects those 
qq phonemes and, based thereon, the speaker's utter- 
O ance. There are natural phonemes, meaning actual 
W utterances, and artificial phonemes which are artifi- 
q dally generated by the computer system in the 

CL 
UJ 



training phase. Thus, an artificial phoneme is pro- 
vided which is associated with pauses occurring 
between the components of a compound word. By 
detecting this interval, the word can be split into its 
components which are separately processed. Also 
provided are artificial phonemes resembling at least 
two natural phonemes. This allows combining dif- 
ferent phoneme sequences of one and the same 
word to a single phoneme sequence. By using artifi- 
cial phonemes, the memory space required is re- 
duced. At the same time, the compare steps for the 
uttered and stored phonemes are reduced, which 
increases the processing speed of the computer 
sytem. 
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